Robust experimentation in the continuous time bandit problem
نویسندگان
چکیده
منابع مشابه
Continuous Time Associative Bandit Problems
In this paper we consider an extension of the multiarmed bandit problem. In this generalized setting, the decision maker receives some side information, performs an action chosen from a finite set and then receives a reward. Unlike in the standard bandit settings, performing an action takes a random period of time. The environment is assumed to be stationary, stochastic and memoryless. The goal...
متن کاملOn the Optimal Reward Function of the Continuous Time Multiarmed Bandit Problem
The optimal reward function associated with the so-called "multiarmed bandit problem" for general Markov-Feller processes is considered. It is shown that this optimal reward function has a simple expression (product form) in terms of individual stopping problems, without any smoothness properties of the optimal reward function neither for the global problem nor for the individual stopping probl...
متن کاملRobust Control of the Multi-armed Bandit Problem
We study a robust model of the multi-armed bandit (MAB) problem in which the transition probabilities are ambiguous and belong to subsets of the probability simplex. We first show that for each arm there exists a robust counterpart of the Gittens index that is the solution to a robust optimal stopping-time problem. We then characterize the optimal policy of the robust MAB as a project-by-projec...
متن کاملRobust Contracts in Continuous Time∗
We study two types of robust contracting problem under hidden action in continuous time. In type I problem, the principal is ambiguous about the project cash flows, while he is ambiguous about the agent’s beliefs in type II problem. The principal designs a robust contract that maximizes his utility under the worst-case scenario subject to the agent’s incentive and participation constraints. We ...
متن کاملFinite-Time Regret Bounds for the Multiarmed Bandit Problem
We show finite-time regret bounds for the multiarmed bandit problem under the assumption that all rewards come from a bounded and fixed range. Our regret bounds after any number T of pulls are of the form a+b logT+c log2 T , where a, b, and c are positive constants not depending on T . These bounds are shown to hold for variants of the popular "-greedy and Boltzmann allocation rules, and for a ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Economic Theory
سال: 2020
ISSN: 0938-2259,1432-0479
DOI: 10.1007/s00199-020-01328-3